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In this paper we try to model certain features of human language complexity by means of advanced 
concepts borrowed from statistical mechanics. We use a time series approach, the diffusion entropy 
method (DE) , to compute the complexity of an italian corpus of newspapers and magazines. We find 
that the anomalous scaling index is compatible with a simple dynamical model, a random walk on 
a complex scale-free network, which is linguistically related to Saussurre's paradigms. The network 
complexity is independently measured on the same corpus, looking at the co-occurrence of nouns 
and verbs. This connection of cognitive complexity with long-range time correlations also provides 
an explanation for the famous Zipf's law in terms of the generalized central limit theorem. 

I. INTRODUCTION 

This introduction is divided into two parts, the former being devoted to a general discussion concerning the chal- 
lenging issue of the search for crucial events, in general. The second is of linguistic interest. This division reflects the 
twofold purpose of this paper. In fact, the main result of this paper is the foundation of the ZipP law, a subject of 
theoretical interest to understand the origin and evolution of language p] . The foundation of this important empirical 
rule is here discussed from a special perspective. This is the perspective of complexity conceived as a new condition of 
physics. Of course, this condition is not expressed by regular dynamics, and it is not expressed by thermodynamics, 
either. This is rather a regime of transition from dynamics to thermodynamics lasting for a virtually infinite time, so 
as to realize a new legitimate state of physics, affording a perspective valid also in the case of non-physical systems. 
A paradigm for this condition is given by the renewal theory processes 2\. Thus, we make use of the renewal theory 
to define the concept of crucial events. 

In conclusion, the first purpose of this paper is to contribute the progress of a statistical method for the search 
of crucial events, using written texts as an illustration of the method. On this important issue we do not reach 
conclusive results but the formulation of a conjecture for future work. However, we note that the linguistic results 
rest on a general approach to the detection of rare and significant events. This general approach to the detection of 
crucial events might yield useful applications to many fields, from the early diagnosis of diseases to the war against 
terrorism. Thus, we devote the first part of the introduction to the definition of crucial events. The second part 
of the introduction serves the purpose of introducing the reader to the ideas and jargon of linguistics, so as to fully 
appreciate the main result of this paper, which is, in fact, of linguistic interest. 

A. Definition of crucial events 

Let us consider a generic time series. This is a sequel of values occurring at different times and mirroring the 
properties of a given complex system, this being either the heart-beating Q , seismic movement 

4], 

or a written text, 

as in this paper. Let us label, or mark, some of these values with a given criterion, which depends, of course, on the 
k> , conjectures that we make about the process under study. Labeling a value implies that we judge it to be significant. 
Thus, it is evident that the more we know about the process, the more plausible the conjecture behind the labeling 
criterion adopted. To give some illustrative example, let us mention some of the criteria recently adopted. In the 
case of heart-beating |3j, we are observing the time distances T between a R peak and the next, which for a very 
large numbers of beats t becomes equivalent to a continuous function, T(t), of a continuous variable t. We divide the 
ordinate axis into small interval of size s, and we record, or mark, the beats at which T(t) moves from one strip to 
another. In the case of earthquakes, we label only the seismic fluctuations of intensity larger than a given threshold 
0. Finally, in the case of a written text, as explained later in this paper, we mark the salient words, these being the 
words that in a given text appear with a frequency larger than in a suitable reference text. 

Now we have to address the important issue of establishing whether the labeled values are crucial events, or not. 
This, in turn, requires that we define what we mean by crucial event. Let us define first the crucial events by mean 
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of their statistical properties. Then we shall explain why events with these statistical properties must be considered 
crucial. We use the prescriptions illustrated in the fundamental paper of Ref. p| . We assign the symbol 1 to all the 
marked values and the symbol to the others. Then we convert the time series into a set of random walks with the 
method detailed in Ref. • We use the walking rule that the walker makes a step ahead by the quantity 1 any time 
we meet a marked value, namely, the symbol 1. When we meet the symbol the walker remains in a state of rest. 
The marked values are much less numerous that the unmarked values. Thus, we shall have many O's and a few l's. 
A sequence of many O's between two l's is termed a laminar region. It is important to evaluate the time distance 
between one marked value and the next, which is equivalent to determining the distribution of the time lengths of 
the laminar regions. This will result in a time distribution if}(t). Let us assume that the waiting time distribution 
ip(t) is an inverse power law with index \x. The Poisson case corresponds to /i = oo. Then we study the resulting 
diffusion process and we establish its scaling parameter S. If the index /i fits the condition 2 < /j, < 3 and it is related 
to the scaling coefficient by the relation 8 = — 1), or, if the scaling index fits the condition fi > 3 and the scaling 
coefficient has the ordinary value S = 0.5, the labelled values are crucial events. 

Let us explain the physical motivation for this definition. First, meeting a is not significant because the O's are 
closely correlated and meeting one of them implies that that we shall meet many other O's, before finding a 1 again. 
In a sense the O's are driven by the l's. A given 1 is the beginning of a laminar region and, consequently, of a cascade 
of O's. On the other hand, if 5 and /i violate the relations for the l's to be crucial, there might exist a correlation 
between two different laminar regions. This suggests that the l's might be part of a predictable cascade of some other 
genuinely crucial events, not yet revealed by the statistical analysis, and that our conjecture is not correct. 

We have to explain why ft < oo is an important condition for our definition of crucial events. We limit ourselves 
to noticingjij that ^ = oo is equivalent to setting Poisson statistics, implying, in turn, a memoryless condition that 
seems to be incompatible with the labelled values to be crucial. This is a poorly understood property, in spite of 
the fact that 32 years ago Bedeaux, Lindenberg and Shuler wrote a clarifying paper on this subject. We refer the 
reader to this fundamental paperQ to understand the reasons why we imagine the crucial events to be incompatible 
with Poisson statistics. Actually, the labelled values seem to become crucial when the diffusion process they generate 
is anomalous, namely S > 0.5, which implies fi < 3. It has to be pointed out that /i < 2 implies a condition where the 
mean length of the laminar region is infinite. This means that in this region the process becomes non-stationary, due 
to the lack of an invariant measure |8|- From an intuitive point of view, the origin of non-stationary behavior is as 
follows. If we keep drawing random number from the distribution ip(t), the mean value of t tends to increase with the 
number of drawings. This is so because, if it did not increase the overall mean value would be finite, in conflict with 
the fact that /z < 2 produces an infinite mean value. Thus the condition it = 2 is the border between the stationary, 
fi > 2, and the non-stationary condition, fi < 2. As we shall see, this border has an interesting meaning for linguistics. 

It would be an interesting issue to assess whether crucial events can be located in this region. We limit ourselves to 
noticing that the results of the statistical analysis of time series seem to denote that complex systems are characterized 
by values of \i very close to the border without ever entering the non-stationary dominion fi < 2. In conclusion, with 
the prescription of Ref. @ we are in a position to assess if the marked values correspond to crucial events, or do not. 

B. The meaning of Zipf's law in linguistics 

Semiotics studies linguistic signs, their meanings, and identifies the relations between sign and meaning, and among 
signs. The relations among signs (letters, words), are divided into two large groups, namely the syntagmatic and the 
paradigmatic, corresponding to what are called Saussurre's dimensions 0- A clearcut definition of the two dimensions 
is outside the scope of this paper, but one can grasp an understanding of them by looking at Fig. ^ The abscissa axis 
represents the syntagmatic dimension, while the ordinate axis represents the paradigmatic one. Along the abscissa 
certain grammatical rules pose constraints on how words follow each other. We can say that this dimension is 
temporal, with a casual order. An article (as "a" or "the"), for instance, may be followed by an adjective or a noun, 
but not by a verb of finite form. At a larger "time-scale" , pragmatic constraints rule the succession of concepts, to 
give a logic to the discourse. The other axis, on the other hand, refers to a "mental" space. The speaker has in 
mind a repertoir of words, divided in many categories, which can be hierarchically complex and refer to syntactical 
or semantic "interchangeability" . Different space-scales of word paradigms can be associated to different levels of this 
hierarchy. After an article, to follow the preceding example, one can choose, at a syntactical level, among all nouns of 
a dictionary. However, at a deeper level, semantic constraints reduce the avalaible words to be chosen. For instance, 
after "a dog" one can choose any verb, but in practice only among verbs selected by semantic constraints (a dog runs 
or sits, but does not read or smoke). The sentence "a dog graduates", for instance, fits paradigmatic and syntagmatic 
rules behind Fig. 1, but the semantics would in general forbid the production of such a "nonsensical" sentence. 

The two dimensions are therefore not quite orthogonal, and connect at a cognitive level. The main focus of this 
paper is to show that this connection, which is well known at a short scales, is in fact reproduced at all scales. We will 
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also show that both dimensions are scale free and that the very complexity of linguistic structures in both dimensions 
can be taken into account in a unified model, which is able to explain most statistical features of human language, 
including, at the largest scales, the celebrated Zipf's law [Tol | . 




Syntagmatic axis 



FIG. 1: Saussurre's dimensions. In this example the first position in the syntagmatic axis is an article, the second a noun and 
the third a verb in the third person. Notice the semantic restriction on the paradigmatic possibilities. 



Zipf's law, which relates the rank r of a words to its frequency / in a corpus, obviuosly points to constraints on the 
frequencies of words. Remarkably, this does not mean that the probability of a word is actually defined. In fact, a 
word may have a small or large frequency depending on the genre of the corpus (i.e. a large collection of written text) 
under study, and even two extremely large corpora of the same type fail in reproducing the same word frequencies. It 
is however remarkable that the occurrence of words is such that for any corpus and for any natural language the same 
universal property emerges, this being that there are only a few frequent words and there are, on the contraty, many 
words that are encountered, for instance, only once or twice. To be more precise, let us define word rank r, which is 
a property depending on the corpus adopted, as follows. One assigns rank 1 to the most frequent word, rank 2 to the 
second frequent one, and so on. Each word is uniquelly associated to a rank, and, although this number varies form 
corpus to corpus, one always finds that 



r 



(1) 



We shall derive this law from a complexity model of language, along the following lines. We recognize that there 
is no experimental linguistic evidence that word frequencies tend to well-defined probabilities. However, we adopt a 
probabilistic approach, namely, we assume that there exists, for a randomly selected word, the probability of having the 
frequency /, denoted as P{f). Operatively, we measure P(f) by counting how many words have a certain frequency 
/. In Section III we show that a P(f) compatible with the Zipf's law can be directly derived from the model proposed 
herein, thus providing a solid experimental support to our model. In order to fulfill our program, we shall assume a 
long-range statistical mutual independence for the occurrence of different concepts. This hypothesis is appealing, since 
this assumption means that every and each occurrence of a concept makes entropy increase, and therefore we identify 
the mathematical information (i.e. entropy) with the common-sense information (i.e. the occurrence of concepts). 
Unfortunately, a concept is not, a priori, a well defined quantity. Herein we shall assume that concepts are represented 
by words (or better by lemmata) or by groups of semantically similar words or lemmata|25|. 

Because of the mutual independence among different concepts, we can extract from a single corpus as many 
"experiments" as the number of concepts. For each experiment we select only one concept and we mark the occurrence 
of the selected word or group of words corresponding to this concept. We shall use an advanced time-series analysis to 
study the statistical properties of the dynamical occurrence of the markers. In fact, as mentioned in the first part of 
this introduction, we adopt the DE method [f|, which has been proved by earlier investigation 001 to be an efficient 
way to detect the statistics of crucial events. We shall discover the important result that the adoption of DE method 
allows us to assess whether a given set of "markers" can be identified with the crucial events or the crucial events are 
still unknown and we must consequently search for a different set of markers. 

The results of the experiments will lead to a language model reproducing the language "complexity" , namely a 
combination of order and randomness at all scales. The overall dynamics, given by the flow of concepts over time, 
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resemble the dynamics of intermittent dynamical systems, like the Manneville's Map |llj: These systems have long 
periods of quiescence followed by bursts of activity. This variability of waiting times between markers of activity is 
responsible for long-range correlations. 

A further aspect of the paper is the connection between space and time complexity, and its application to linguistics. 
We shall assume that atomic concepts exist and represent nodes of a complex network, connected by arcs representing, 
when existing, semantic associations between a concept and another. We assume that our markers are actually defined 
as a group of neighbor nodes. We then assume that language is the result of a random walk, namely a random 
"associative" travel from concept to concept. The scale-free properties of this network, independently measured by 
several research groups, including ours, provides a bridge to understand the intermittent dynamics earlier described. In 
this unified model the network is a representation of Saussurre's paradigms, whose complexity mirrors the syntagmatic 
one in the asymptotic limit. 



II. DE AND CONCEPTS 



Several papers have recently proposed a new way of detecting long-range effects of non-Poisson statistics jj, |5j, [lj, 
Il3l IbH ]. In short, as mentioned in the first part of Section 1, we define a set of "markers" among the symbols or 
values of a time series, and we evaluate numerically the probability of having a number x of markers in a window of 
length t, denoted as p(x; t). This statistical analysis is done by moving a window of length t along the sequences and 
counting how many times one finds x markers inside this window. Finally, p(x; t) is obtained by dividing this number 
by the total number of windows of size t, namely by N — t + 1, where N is the total length of the sequence. Since 
we typically deal with large values of both x and t, we legitimately adopt a continuous approximation. Moreover, we 
make the ergodic and stationary assumption, thereby expecting the emergence at long times of the scaling condition 
expressed by 



, x 1 ^ fx — wt \ ,„ s 
P( x rt = # F {—?-)> ( 2 ) 

where w is the overall probability of observing a marker, 8 is the scaling index and F is a function, sometimes called 
"master curve" . If F is the Gauss function, 8 is the known Hurst index, and if the further condition 8 = 0.5 is 
obeyed, then the process is Poisson, the variable x undergoes ordinary Brownian diffusion. According to Ref. [3 the 
occurrence of markers in time is regulated by random fluctuations, and the diffusion process occurs with no memory. 
It straightforward to show that the Shannon Information 



S(t) — J dxp(x;t) In p(x;t) (3) 

with condition J5J) leads to 



S{t) = k + S\nt, (4) 

where k is a constant. The evaluation of the slope according to which S increases with hit provides therefore a 
measure for the anomalous scaling index 8. 

Let us briefly mention what we know about applying DE to time series with non-Poisson statistics. We derive from 
the time series under study an auxiliary sequence, by setting £j = 1 (this means that we find the marker at the ith 
position), or £j = (the i-th symbol, or value, is not a marker). We then assume "informativity" for the markers. 
As explained in the first part of Section 1, this means that we assume the markers to be crucial events, and, that, 
consequently, the distance between a given "1"' and the next does not depend on the distance between this given 1 
and the preceding one. As mentioned earlier, if the lengths t of the laminar regions are distributed as 

77*- 1 

m = (»-D ¥TW (5) 

(ip(t) ~ t~^ asymptotically is a sufficient condition), then the theory based on CTRW and GCLT yields for p(x; t) a 
truncated Levy probability distribution function (PDF) 5]. DE detects the approximate scaling 8 of the central part, 
namely 
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6 = if 2 < n < 3, 5 = 0.5 if > 3. (6) 

fi — 1 

We note that i) is abruptly truncated by a ballistic peak at x — representing the probability of finding no 
marker up to time t. However, the diffusion entropy increase is essentially determined by the occurrence of crucial 
events, thereby leading to the scaling of Eq. JJJJ. 

It is worth explaining why the condition 2 < /i < 3 means long-range memory. The first, and probably easier way, 
is based on observing that truncated Levy probability distribution p(x\ t) yields: 

(x 2 (t))-{x(t)f^t^ (7) 

thereby leading, as shown in Ref. [l^ , to the correlation function 



((g(*l)-(fl)(g(*2)-(fl)) 1 ,„x 

(O |*2 -ill"- 2 " [ ' 

Note that the memory stemming from the non-Poisson nature of markers, corresponds in this case to a non-integrablc 
correlation function, which is therefore a clear signal of infinite memory. In the Poissonian case this correlation 
function would be exponential, thereby indicating a much shorter memory. 

There exists a second, and more impressive way, of relating the non-Poisson distribution of waiting times to memory, 
according to the first part of Section 1. We are observing the fluctuations of a dichotomous variable £, with £ = 1 
implying the occurrence of a crucial event, and £ = the non-occurrence of events. If the role of time is played by 
the ordinal number of words, we shall show that a good choice of semantic markers makes Eq. (J^J become a good 
model for the dynamics of concept of natural language. The occurrence of Poisson statistics would mean, according to 
Ref. 0, that the time evolution of markers is driven by fluctuations that are identical to white noise. In conclusion, 
Poisson statistics would be totally incompatible with the existence of any form of memory. The exponential decay 
of the correlation function of Eq. JSJl would not be a short-time memory, but it would signal the complete lack of 
memory. The long-range correlation corresponding to /i < 3, on the contrary, is a clear manifestation of the infinitely 
extended memory of a text, associated to the strikingly non-Poisson character of the waiting time distribution of 
Eq.©. 

Eq. ® @ rests on uncorrclatcd waiting times between events. This means that if two markers are separated by 
intervals of words of duration Tk (the distance in words between the /c-th and the k + 1-th occurrence of the marker) 
then (TiTj) oc 5{j, where &y is the Kroeneker delta. Under these conditions each event carries the same amount of 
information. The statistical independence of different laminar regions means that the information carried by the 
events is maximal for a given waiting time distribution ip( T )- in a linguistic jargon, we can say that if in a corpus 
we find a marker (e.g. a list of words) such that 5 « l/(/i — 1) then this marker is informative in that corpus. If 
the markers are not informative, this important condition is violated. We shall see hereby that certain markers, e.g. 
punctuation marks, are not real events, but are rather modeled by a Copying Mistake Map (CMM) |l6j| . This mean 
that discourse complex dynamics are such that, although the punctuation marks carry long-range correlations, and 
consequently the anomalous scaling of p(x; t) , the waiting times in the laminar regions determined by these markers 
are correlate. Punctuation marks are not informative. Their complexity is just a projection of a complexity carried 
by "concept dynamics" , namely, by the really crucial events. 



A. The CMM and non-informative markers 

The Copying Mistake Map (CMM) is a model originally introduced to study the anomalous statistics of 
nucleotides dispersion in coding and non-coding DNA regions. The CMM is a sequence that can be derived from 
the long-range sequence generated by the waiting tiime distribution of Eq. (J5J , and corresponding to the long-range 
correlation function of Eq. JHJ) as follows. We define the probabilities p(l) — e and p{2) = 1 — e. For any site of the 
sequence, with label i, we leave the value £j unchanged with probability p(l) and, with probability p(2) we assign to 
it a totally random value (copying mistake). 

The resulting waiting time distribution undergoes an exponential decay. This is so because the occurrence of a 
given 1 depends on two distinct possibilities: it can be an "original" 1, or an original turned into 1 by a copying 
mistake. Formally, we write 



1pexp{t) = VVand (*)*corr(*) + * ran<j(*)VWr(*) , 



(9) 
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where *(t) = J t °° dti/>(t) and V rand {t) = J t °° dfip rand (f), and 



Aand(t) = In (j^l) ■ (x~^) ■ ( 10 ) 

We note that ip ra nd(t) and, consequently, ^ , ran d(i) undergo an exponential decay. So it does, in the asymptotic limit, 
the function rp exp (t). What about the DE curve? The theory predicts 0] a random (5 = 0.5) behavior for short times, 
a knee, and a slow transient to the totally correlated behavior. An example of CMM is given by punctuation marks 
in Natural Language. We choose punctuation marks as markers for an Italian corpus of newspaper and magazines, 
of more than 300,000 words length, called Italian Treebank (hereafter TB). We assign the value to any word that is 
not a punctuation mark, and 1 when it is, namely to full stops, commas, etc. A sentence like "Felix, the cat, sleeps!" 
is therefore transformed into "0100101" . 

Figs. II (a,b and c) show that the choice of these markers yield a time series with the typical properties of a CMM 
sequence. This means that the waiting times t\. are correlated. According to section 1A this means that puntuation 
marks are not crucial events. Notice however that the use of the DE method detects an anomalous scaling parameter 
6. This means that there are hidden crucial events and that they generate an infinitely extended memory. 



B. Concepts as informative markers 



More experiments, not reported here, show that the CMM behavior is typical for many characters, and are shared 
by all the letters of the alphabet, with a S rj 0.6. Obviously, If the marker is a given letter, the role of time is played by 
the ordinal number of typographical characters in the text. We note that in Italian the alphabetic characters mirror 
phonetic. Thus, it is worth of linguistic interest to move from a phonetic to a morphosyntactical level. To do so, a text 
has to be lemmatized and tagged with respect to its part of speech. After this procedure we can identify as a marker 
the occurrence of a certain part of speech e.g. article, adverb, adjective, verb, noun, preposition, numeral, punctuation 
etc.). For instance, the sentence "Felix, the cat, sleeps!" is now transformed into "NPRNP VP", where N, P, 
R and V stand for nouns, punctuation, article and verb. If we select a verb as a marker, we have "0000010" . Fig 3 
shows the result of this experiment for verbs and for numerals. We notice that both cases lead to the same asymptotic 
behavior of the DE, but to completely different behavior of the waiting-times distribution ip(t) (where t is the number 
of words between markers). We notice that DE reveals a long-time memory in both cases. As to ip(t), it shows an 
exponential truncation at long times, in both cases. However, in the case of numerals we find an extended transient 
with a slope ^numerals < 2, corresponding to the non-stationary, as pointed out in the first part of the introduction. 
This is in fact due to the uneven distribution of numerals in the corpus, since they are encountered more often in the 
economic part of the Italian newspapers. As incomplete as this analysis is, yet it reveals that numerals reveals carry 
more information than phonetic markers or even than the verbs, used as markers. This is linguistically interesting, 
since numerals denote a part of speech, but also a "semantic class" . 

We are therefore led to suppose that informative markers are the ones associated with a semantically coherent class 
of words. This is however a problem, since every single concept is too rare in a balanced corpus (a long text with a 
variety of genres). The next level of our exploratory search for events is therefore to look for the occurrence of "salient 
words" in a specialist text. Such a corpus has been made available as the Italian corpus relative to the European 
project POESIA [l^. POESIA is a European Union funded project whose aim is to protect children from offensive 
web contents, like, e.g. pornography in WWW URLs. Salient "pornographic" words were automatically extracted 
by comparing their frequency in an offensive corpus, to the frequency of their occurrence in the balanced TB corpus 
earlier used. The definition adopted was 



JEC(i) + JTB(l) 

where /ec(1) is the frequency, in the erotic corpus, of the lemma I, and /tb(0 is the same property in the reference 
Italian corpus (Italian Treebank). Salient lemmata were automaticaaly chosen as the 5% with the highest value of s. 
Notice that in this experiment all "dirty" words are not taken into consideration, because they do not appear in the 
reference corpus, and therefore s = 1, but, in effect, cannot be properly defined, especially for extremely rare words. 
However an offensive metaphoric use of terms is in fact detected, leading to a completely new way to automatic text 
categorization and filter |l8j . using a method, based on DE analysis, called CASSANDRA 0. 

Salient words were therefore used as markers for our analysis, as earlier described. The results are shown in Fig. 
4, clearly showing that in a specialized corpus, salient words of this genre, pass the test of informativeness. Salient 
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FIG. 2: a) Diffusion entropy for punctuation marks. The fit for the asymptotic limit (solid line) yields a S — 0.7. The dashed 
line marks a transient regime with S = 1 b) Non-normalized distribution of waiting times for punctuation marks, namely counts 
of waiting times of lenght t between marks in TB. The expression for the dashed line fit is 7000 • exp(— i/7.15). c) Second 
moment analysis for punctuation marks. The expression for the solid line fit is 0.06 -t 1,57 . Notice that 1.57 ~ 3 — 1/0.7, namely 
the expression H = 3 - 1/8 of Ref. for Levy processess stemming from CMM's is verified. 



words, and plausibly words in general, are therefore distributed like markers generated by an intermittent dynamical 
model, with \x ss 2.1 and, in agreement with |JBJ, S rs l/(fi— 1) = 0.91. We see in the next section how this behavior 
is plausibly connected with a topological complexity at the paradigmatic level, and in Section III we derive the Zipf's 
law from the resulting model. 



III. SCALE-FREE NETWORKS, INTERMITTENCY AND THE ZIPF'S LAW 

In this section we outline a cognitive model that serves the purpose of connecting structure and dynamics. Allegrini 
ct al. 2(j identified semantic classes in the Italian corpus, by looking at paradigmatic properties of interchangeability 
of classes of verbs with classes of nouns. They defined "super-classes" of verbs and nouns as "substitutability islands" , 
namely groups of nouns and verbs sharing the properties that in the corpus you find each verb of the class co- occurring, 
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FIG. 3: a) DE for verbs (squares) and numerals (circles). The dashed line is a fit for the verbs, with 8 — 0.73, while the solid 
line is a fit for the numerals, with 8 — 0.82. b) ip(t) for verbs (black circles) and numerals (white circles). The dashed line is a 
fit for the numerals, with [i = 1.2, while the solid line is a fit for the verbss, with /i = 3.8. 




1 10 100 1000 1 10 100 1000 

t t 

FIG. 4: a) DE for salient "erotic" words for a corpus of erotic stories and offensive web pages (squares), and for the italian 
reference corpus (circles). The solid line is a fit with expression S(t) = k + S\n(t + to), where the additional parameter to 
is added to the original Eq. (3) to take transients into account and to improve the quality of the fit, yielding 8 = 0.91 b) 
Non-normalized waiting time distribution for salient "erotic" words for a corpus of erotic stories. The expression for dashed 
fine fit is 14000 ■ (12.0 + t)~ 2 \ yielding fi = 2.1. 



in a context, with each noun of the class |20j. This is precisely a direct application of the notion of "paradigm". Let 
us call p v (c) and p n (c), respectively, the number of verbs or nouns belonging to a number c of classes. They found 
that 

Pn(c) OC -1^, (12) 

where r\ is a number whose absolute value is (much) smaller than 1. 

Working along the same lines, other authors [2l| found a "small world" topology [22], by looking at the number of 
synonyms in an English thesaurus, for each English lemma. We therefore assume that this kind of structure is general 
for any language. Let us therefore imagine that the paradigmatic structure of concepts is a scale-free network and 
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consider a random walk in this "cognitive space". Let us make the following assumptions: 

• The statistical weight of the i-th node is u>i ~ Cj 

• The dynamics of the system is ergodic so that the characteristic recurrence time is tj ~ c" 1 

• The above properties hold for all nodes, with the same functional form for the recurrence time distribution, e.g. 

ipi(t) = (l/Ti)exp(-t/Ti) 

Now we imagine that selecting a concept means selecting a few neighbouring nodes. This collection of nodes, due to 
the scale-free hypothesis, shares the same scaling properties as the complete scale free network, namely p(c) ~ C v . 
Therefore we have that 

dcc 2 e~ ct — ~ -3^. (13) 

Thus, we recover the intermittent model of Eq.(|SJ. 

At this stage, deriving the Zipf's law / oc r~ a , with a close to unity, becomes a simple exercise Let us define 
frequency probability P(f) 



P(f)df = prob(r)dr => P(f) ~ /-— (14) 

Next, let us notice that P(f) must be a stable distribution. In fact, the Zipf's law is valid for every corpus. In 
particular, if it is valid for corpus A and for corpus B, it is valid also for the corpus A + B, where + means the 
concatenation of corpora. If we keep going on with this concatenating process, we shall have a corpus 

Total Corpus = Corpus A + Corpus B + ... . 

Thus, the frequency of a word in the total corpus, ftot, is written in terms of the single frequencies /1, f%, . . ., and of 
total lengths N\, N2, ... of the single corpora, as follows 



, /1+/2H _ 1 \ - , 

hot ~ N 1 + N 2 + --- ~ E,^ 2 - ( ' 



i.e., the Generalized Central Limit Theorem [2ij applies. This means that the probability of frequency P(f) is a 
Levy a-stable distribution. This probability of finding / occurrences of a word in a corpus of a given length can be 
identified with p{x; t) of Section II, if we take into proper consideration the length t as a parameter. We have earlier 
noticed that p(x;t) in language is Levy process, with 5 ~ 1, and therefore with a tail P{f) ~ /~ 2 . In other words 
through (|14fl we recover JIJ i.e. the Zipf's law. 



IV. CONCLUSIONS 



We have identified the cognitive process governing human language, and proved it to be complex at both syntagmatic 
and paradigmatic level. Although the study is conducted on Italian written corpora, we are inclined to believe that 
the property found is language independent, general and important, as emerging from a decade of studies on the 
Zipf's law. Thus, we find that each concept corresponds to a scaling close to 8 = 1, this being consistent with the Zips 
law. The complexity of language is located at the border between the stationary (/i > 2 and non-stationary (/i < 2) 
condition 0. As pointed out in Section 1A, we adopt the perspective of complexity as a condition of transition from 
dynamics to thermodynamics. Within this perspective, we notice that, although the condition /i > 1 means aging [24[ 
and a transition from dynamic to thermodynamics in a virtually infinite time, a thermodynamic condition exists, and 
diffusion process tends to approach the scaling regime with t — > 00. In the case fi < 2, no thermodynamic condition 
exists, and the stationary condition cannot be realized, not even ideally. We see that specialist texts are located 
within the ergodic regime (5 — 0.91 in the experiment illustrated in this paper). We are convinced that this is a sign 
of the fact that language rests on the subtle balance between two opposite needs, learnability, which is the property 
consenting a child to learn a language, and variability, namely the need of exploring a virtually infinite cognitive 
space. Thus we are convinced that the results of this paper might help to understand the language complexity and 
evolution in children during the learning years, and in psychopathological subject. Thus we propose its use for future 
research work that might lead to the understanding of certain mental diseases with no recourse to invasive diagnostic 
methods. 
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We think that the results of this paper might also help understanding the complexity of really living systems. We 
have seen that a random walk through the network, with the scale-free hypothesis generates memory and long-range 
correlation. The speaker sets concepts in a temporal order, but this corresponds to complex relations in a non- 
linear space. We argue that a similar mechanism may also be chosen by really living systems. Thus, the relational 
interconnectivity of chemicals in the living cell may depend on the long-range memory of the cell function, or influence 
it. From a Language Engineering point of view, this study provides a theoretical background for a completely new 
strategy of automatic text categorization. A prototype is being implemented as a semantic filter 0] . We think that 
the proposed test for informativness for a set of markers will be beneficial more in general for the analysis of time 
series. 

Finally, we want to stress that the method successfully adopted in this paper to establish the information content 
of markers, or labeled values, of the time series, might have possible applications to the war against terrorism. In 
fact, if the intelligence community affords suggestions on the possible crucial words of a new wave of terrorist attack, 
the adoption of this method allows an automatic way to identify the messages with a terrorist content, more or a less 
in the same way here illustrated to filter text with a pornographic content. 

Acnowledgments Financial support from ARO, through Grant DAAD19-02-0037, is gratefully acknowledged. 



[1] R. Ferrer i Cancho and R. V. Sole, "Zipf's law and eandom texts", Adv. in Complex Systems, 5, 1 (2002). 
[2] D. R. Cox and P.A. Lewis, "The Statistical Analysis of Series of Events", Methuel, London (1966). 

[3] P. Allegrini, , P. Grigolini, P. Hamilton, L. Palatella, and G. Raffaelli, Memory beyond memory in heart beating, a sign of 

a healthy physiological condition Phys. Rev. 65, 041926-1-5 (2002). 
[4] M. S. Mega, P. Allegrini, P. Grigolini, V. Latora, L. Palatella, A. Rapisarda, and S. Vinciguerra, Phys. Rev. Lett. 90, 

188501 (1-4) (2003). 

[5] P. Grigolini, L. Palatella and G. Raffaelli, Fractals 9, 439 ( 2001). 

[6] P. Allegrini, G. Aquino, P. Grigolini, L. Palatella, A. Rosa, cond-mat/0304506 

[7] D. Bedeaux, K. Lakatos Lindenberg, and K. E. Shuler, J. Math. Phys. 12, 2116 (1971). 

[8] M. Ignaccolo, P. Grigolini, A. Rosa,Sporadic randomness: The transition from the stationary to the nonstationary condi- 
tion, Phys. Rev. E64,026210 (1-11) (2001). 
[9] K. Silverman, The Subject of Semiotics, Oxford University Press, New York (1983). 
[10] GK Zipf, Psycho-Biology of Languages (Houghton-Mifflin, 1935; MIT Press, 1965) 

[11] P. Manneville, Intermittency, self-similarity and 1/f spectrum in dissipative dynamical systems, J. Physique vol. 41 (1980), 
1235-1243. 

[12] P. Allegrini, R.Balocchi, S. Chillemi, P. Grigolini, P. Hamilton, R. Maestri, L. Palatella, G. Raffaelli, Long- and short-term 
analysis of heartbeat sequences: Correlation with mortality risk in congestive heart failure patients, Phys. Rev. E 67, 
062901 (2003) , 

[13] N. Scafetta, P. Hamilton, P. Grigolini, The Thermodynamics of Social Process: the Teen Birth Phenomenon, Fractals, 9, 
193 (2001); Nicola Scafetta, and Paolo Grigolini, Scaling detection in time series: diffusion entropy analysis, Phys. Rev. E 
66, 036130 (2002). 

[14] P. Allegrini, V. Benci, P. Grigolini, P. Hamilton, M. Ignaccolo, G. Menconi, L. Palatella, G. Raffaelli, N. Scafetta, M. 
Virgilio, J. Yang, Compression and Diffusion: a Joint Approach to Detect Complexity, Chaos Solitons & Fractals 15, 
517-535 (2003). 

[15] G. Trefan, E. Floriani, B. J. West and P. Grigolini, A Dynamical Approach to Anomalous Diffusion: Response of Levy 

Processes to Perturbation, Phys. Rev. E 50, 2564 (1994). 
[16] P. Allegrini, M. Barbi, P. Grigolini and B. J. West, Dynamical model for DNA sequences, Phys. Rev. E, v. 52, p. 5281 

(1995); P. Allegrini, M. Buiatti, P. Grigolini and B. J. West, Non-Gaussian statistics of anomalous diffusion: The DNA 

sequences of prokaryotes, Phys. Rev. E, v. 58, p. 3640 (1998). 
[17] N. Scafetta, V. Latora, P. Grigolini, Levy statistics in coding and non-coding nucleotide sequences, Physics Letters A 

299 (5-6), 565-570 (2002); N. Scafetta, V. Latora, P. Grigolini, Scaling without detrending: the diffusion entropy method 

applied to the DNA sequences, Phys. Re v. E 66, 031906 (2002). 
[18] Visit URL http://www.poesia-filter.org for all information about the Poesia Project (Public Open-source Environment 

for a Safer Internet Access), European Project Number IAP 2117/27572 (2002), and the open-source Poesia filter. 
[19] P. Allegrini, P. Grigolini, L. Palatella, G. Raffaelli, M. Virgilio, Facing non-stationarity Conditions with a New Indicator 

of Entropy Increase: the CASSANDRA Algorithm, in: Novak, M.N. (ed.): Emergent Nature. World Scientific, Singapore 

(2002) 173-184. 

[20] P. Allegrini, S. Montemagni and V. Pirrelli, Extracting Word Classes from Data Types, COLING Proceedings, Saarbruecken 
2000; P. Allegrini, S. Montemagni, V. Pirrelli, Example-based Automatic Induction of Semantic Classes Through Entropic 
Scores, Rivista di Linguistica Computazionale, in press. 

[21] A.E. Motter, A.P.S. de Moura, Y.-C. Lai, and P. Dasgupta, Topology of the conceptual network of language, Phys. Rev. 
E 65, 065102 (2002). Rapid Communications. 

[22] D.J.Watts and S.H.Strogatz, Collective dynamics of 'small-world' networks, Nature 393 (1998) 440; D.J.Watts, Small 



11 



Worlds, (Princeton University Press, Princeton NJ, 1999); A.-L.Barabasi, Linked : The New Science of Networks, (Perseus 

Publishing, Cambridge, MA). 
[23] W. Feller, "An introduction to probability theory and its applications," vol. 1, Wiley, New York, 1971 
[24] P. Allegrini, J. Bellazzini, G. Bramanti, M. Ignaccolo, P. Grigolini, and J. Yang, Scaling breakdown: A signature of aging, 

Phys. Rev. E . 66, 015101 (1-4) R (2002). 
[25] A lemma is defined as a representive word of a class of words, having different morphological features. For instance the 

word "dogs" has lemma "dog" , and word "sleeping" has lemma "sleep" . 



